Description

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

Data Dictionary

1. Load libraries

2. Loading and Exploring the Data

3. EDA

MSNO matrix gives us the information on the missingness of the data. Looking at the graph above, we can conclude that we do not have missing values at all

A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering.The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together.

Data Preprocessing

Outlayer detection using box plot

Model building - Logistic Regression - sklearn library

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a person can take personal loan but actually the person was not interested to take.
  2. Predicting a person do not want to take personal loan but actually personal was interested in taking loan.

Which case is more important?

How to reduce losses?

Logistic Regression - Stats Model

But first we will have to remove multicollinearity from the data to get reliable coefficients and p-values.

Additional Information on VIF

Multicollinearity

Observations:

  1. Dropping Age or ZIPCode doesn't have a significant impact on the model performance.
  2. We can choose any model to proceed to the next steps.
  3. Here, we will go with the lg1 model - where we dropped Age
  4. Some of the categorical levels of a variable still have VIF>5 but removing certain categories from a variable will impact the interpretations from the model.

ROC - AUC

Converting coefficients to odds

Interpretation for other attributes can be done similarly.

Model Performance Improvement

Let's use Precision-Recall curve and see if we can find a better threshold

Model Performance Summary

Conclusion

Decision Tree Model

Build Decision Tree Model

We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 90% accuracy, hence accuracy is not a good metric to evaluate here.

Insights:

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Recall has improved for both train and test set after hyperparameter tuning and we have a generalized model.

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Visualizing the Decision Tree

Misclassified Data Analysis

Conclusion

Recommendations